DI Architecture & Components

High-Level DI Architecture

 

Components

There are several functional components which comprise the DIClosed The Data Integration (DI) layer is a vital part of the Digital Integration Hub (DIH) platform. It is responsible for a wide range of data integration tasks such as ingesting data in batches or streaming data changes. This is performed in real-time from various sources and systems of record (SOR. The data then resides in the In-Memory Data Grid (IMDG), or Space, of the GigaSpaces Smart DIH platform. layer. All serve ongoing DI operations as part of the data integration layer of the Smart DIH platform.

The table below summarizes all the key DI layer components.

Name Purpose Details
CDCClosed Change Data Capture. Primarily used for data that is frequently updated, such as user transactions (IIIDR) Source Database Agent Captures the changes from the source System of Records. For example Oracle, DB2, MSSQL.

IIDRClosed IBM Infosphere Data Replication. This is a solution to efficiently capture and replicate data, and changes made to the data, on relational source systems. Used to move data from DataBases to the In-Memory Data Grid. Source database agent is a java application that usually installed on a source database server.

IIDR Agent captures changes from the source database transaction log files in real time

IIDR Oracle agent Port: 11001

IIDR Db2 zos agent Port: 11801

IIDR Db2 AS-400 agent Port: 11111

IIDR MSSQL agent Port: 10501

IIDR Target KafkaClosed Apache Kafka is a distributed event store and stream-processing platform. Apache Kafka is a distributed publish-subscribe messaging system. A message is any kind of information that is sent from a producer (application that sends the messages) to a consumer (application that receives the messages). Producers write their messages or data to Kafka topics. These topics are divided into partitions that function like logs. Each message is written to a partition and has a unique offset, or identifier. Consumers can specify a particular offset point where they can begin to read messages. Agent Writes the changes captured by IIDR source agent to Kafka.

IIDR Kafka Agent is a Java application that runs on the Linux machine and writes changes captured by the IIDR source agent to Kafka.

Port:11710

IIDR Access Server IIDR administration service and Metadata Manager

IIDR Access Server is responsible for creating all logical IIDR entities and objects such as subscriptions and data stores.

All metadata is stored in the internal IIDR database (Pointbase)

IIDR AS Port: 10101

DI Manager This is the primary interface which controls all DI components

Web service, exposes REST APIsClosed REpresentational State Transfer. Application Programming Interface An API, or application programming interface, is a set of rules that define how applications or devices can connect to and communicate with each other. A REST API is an API that conforms to the design principles of the REST, or representational state transfer architectural style. to:

1) Create pipeline and source db connection

2) Stop/ start pipeline

3) Other administration tasks

Port: 6080

DI MDM (Metadata Manager) Stores and retrieves metadata in ZookeeperClosed Apache Zookeeper. An open-source server for highly reliable distributed coordination of cloud applications. It provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.

Web service, expose RESTClosed REpresentational State Transfer. Application Programming Interface An API, or application programming interface, is a set of rules that define how applications or devices can connect to and communicate with each other. A REST API is an API that conforms to the design principles of the REST, or representational state transfer architectural style. API

Communicates with DI Manager

Stores and retrieves metadata that is essential for a DI operation:

1) Data dictionary about tables, columns and indexes

2) Pipeline configuration

3) Other important metadata records that are required for ongoing DI operations

Port: 6081

DI Processor Java library run by FlinkClosed Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. as a job. It is responsible for writing changes to the space. Java library , deployed to Flink and invoked as a Flink job. Main responsibility to read messages from Kafka , perform a transformation from a Kafka message into a space document and write this change into the space relevant object.
Zookeeper (ZK) Serves as a persistent data store for DI components. Serves as a ZK that is required by Kafka.

ZK runs on 3 nodes for H/A purposes. ZK data is replicated between all nodes.

Port: 2181

Kafka Serves as a streaming processing platform.

Kafka is deployed in a cluster of 3 nodes when it uses ZK is its dependency.

IIDR publishes changes to the Kafka topic and theDI Processor (Flink job) consumes these messages and writes changes to SpaceClosed Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model..

Kafka Port: 9092

High-Level Data Flow

DI Subscription Manager

DI Subscription Manager is a web service that exposes a set of APIs on a port. Its unified API has control over CDC components. Only CDC components are in direct contact with the SoR.

 

 

DI Subscription Manager is a micro-service that is responsible for providing the following functionality:

1. Unified API that controls various CDC engines to implement the GigaSpaces pluggable connector vision.  It creates and updates IIDR entities.

  • Defines CDC flows and entities. Defines a new subscription.

  • Start / Stop subscription data flow via IIDR

  • Monitors the status of the IIDR components

2. Unified method to extract data dictionary from various sources, such as the CDC engine , source database , schema registry or enterprise data catalog and populate the DI data dictionary internal repository (MDM)

3. Data dictionary extraction from the IIDR.

  • Significantly simplifies DI operations

  • Only IIDR components connect to the source database

  • There is a unified data dictionary extraction, regardless of the source database type

StreamSQL

StreamSQL allows Smart DIHClosed Digital Integration Hub. An application architecture that decouples digital applications from the systems of record, and aggregates operational data into a low-latency data fabric. users to implement a low code (SQL) approach to define and operate with ad-hoc data flows, such as read from Kafka and write directly to the Space or read from one Kafka topic and write to another Kafka topic.

Behind the scenes StreamSQL utilizes powerful low-code Flink capabilities to define a schema via SQL CREATE TABLE API.

 

StreamSQL operation activities can be defined using SQL statements, for example:

  1. Define structure of messages in a Kafka topic as a table (CREATE TABLE)

  2. Define a data flow (stream of data or pipeline) as INSERT AS SELECT statement

  3. Perform a join of data flow from different Kafka topics using a standard SQL join statement

 

One of the useful StreamSQL use cases is IoT when continuous flow of sensors data changes is consumed from Kafka, aggregated into a summary table and pushed the aggregated summary to space for data services consumption.

SpaceDeck

For information about how StreamSQL is implemented in SpaceDeckClosed GigaSpaces intuitive, streamlined user interface to set up, manage and control their environment. Using SpaceDeck, users can define the tools to bring legacy System of Record (SoR) databases into the in-memory data grid that is the core of the GigaSpaces system. refer to StreamSQL.